Abstract

As a science major Ph.D. student, I have heard many rumors saying that it is useless and simply waste of time to get a Ph.D because many successful people don’t have a Ph.D. So I decided to take a look at this matter and ask some questions like:

  1. How much someone earns with different level and type of degree?
  2. What’s the unemployment for different level and type of degree?
  3. What state has the lowest unemployment rate for different level and type of degree?

To answer these questions, I use the US census data with the kaggle. Here attached a link about the data description and the dictionary

The US Census data

Every year, the US Census Bureau runs the American Community Survey, which about 3.5 million households are asked detailed questions about who they are and how is their life. There are two part of the data: Housing and Population, here because the questions I asked related to a person, so I choose to use the population data. This is a huge data set as giving in the link below, It’s too big to upload so I choose to show my work under the kaggle’s Scripts.

Setup libraries

Libraries I used for analyze

library("plyr")
library("dplyr")
library("data.table")
library("ggplot2")
library("choroplethr")
library("scales")

Read the data

Here I will read the data and save the columns that interesting to me like

  • “PINCP”: Total person’s income
  • “SCHL”: School level: 21:Bachelor, 22:Masters, 24:Ph.D.
  • “SEX”: 1:Male, 2:female
  • “ST”: Different State
  • “AGEP”: Age from 0 to 99
  • “ESR”: Employment status: 3:unemployment
  • “SCIENGP”: Degree Flag: 1:Science/Engineering fields 2:Not in S/E fields
  • “SCIENGRLP”: Degree Flag2: 1:S/E related fields 2:Not in S/E related fields

We know that Science/Engineering degree are most degree obtained in a science/engineering department:

The S/E related fields includes those degrees that has core course in science department like nurse, physician, surgeon etc.

reRead <- 2
#read in data and save it in populData.RData
if(reRead==1){
  colsToKeep <- c("PINCP", "SCHL", "ESR","SCIENGP", "ST","SEX","AGEP","SCIENGRLP")
  popDataA <- fread("../input/pums/ss13pusa.csv", select=colsToKeep )  
  popDataB <- fread("../input/pums/ss13pusb.csv", select=colsToKeep )
  populData <- rbind(popDataA, popDataB)
  rm(popDataA, popDataB)
  save(populData, file="populData.RData")
  summary(populData)
  nrow(populData)
  head(populData,5)
}else{
  load("populData.RData")
} 

We can find many NAs in the SCIENGP and SCIENGRLP since not everyone has a Bachelor’s degree or higher, so next step I will take the populData and omit all NAs. and filter that data by SCHL where I will only select people with Bachelor,Mater and Ph.D. degrees. There are two flags to indicate the type of one’s degree. I combine these two flags to one single column SciEng since there is only little overlap between Science/Engineering and Science/Engineering related field, I use a simple equation to map two flag to one SciEng If someone’s degree is in Science/Engineering, SciEng =2, if they are in Sci/Eng related SciEng = 1, others will have SciEng = 0.

##        ST             AGEP            SCHL            SEX       
##  Min.   : 1.00   Min.   :19.00   Min.   :21.00   Min.   :1.000  
##  1st Qu.:12.00   1st Qu.:35.00   1st Qu.:21.00   1st Qu.:1.000  
##  Median :27.00   Median :48.00   Median :21.00   Median :2.000  
##  Mean   :27.45   Mean   :49.06   Mean   :21.43   Mean   :1.532  
##  3rd Qu.:41.00   3rd Qu.:61.00   3rd Qu.:22.00   3rd Qu.:2.000  
##  Max.   :56.00   Max.   :95.00   Max.   :24.00   Max.   :2.000  
##       ESR           PINCP            SCIENGP        SCIENGRLP    
##  Min.   :1.00   Min.   : -12000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.00   1st Qu.:  20400   1st Qu.:1.000   1st Qu.:2.000  
##  Median :1.00   Median :  46000   Median :2.000   Median :2.000  
##  Mean   :2.41   Mean   :  63149   Mean   :1.653   Mean   :1.899  
##  3rd Qu.:6.00   3rd Qu.:  80000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :6.00   Max.   :1276000   Max.   :2.000   Max.   :2.000  
##      SciEng      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.7778  
##  3rd Qu.:2.0000  
##  Max.   :2.0000
## [1] 638002

By simply looking at the summary of the data, we can find some interesting results for the group of people that hold a degree of Bachelor or higher:

  1. Total number of the record is 638002
  2. The average age of this group is 49
  3. The youngest one is 19 and the oldest is 95
  4. There are slightly more female than male
  5. The Median income is 46000 and the Average income is 63149
  6. Most people hold a none Science&Engineering degree

Now I am adding names to corresponding columns and set the order of degree as Bachelor,Master and Doctorate

## Source: local data frame [10 x 12]
## Groups: SCHL
## 
##    ST AGEP SCHL SEX ESR  PINCP SCIENGP SCIENGRLP SciEng DegLevel    sex
## 1   1   63   21   2   1  39930       2         1      1 Bachelor female
## 2   1   72   21   1   1  53600       1         2      2 Bachelor   Male
## 3   1   22   21   1   3   2000       2         2      0 Bachelor   Male
## 4   1   22   21   2   3  16000       1         2      2 Bachelor female
## 5   1   52   21   2   1 100000       2         1      1 Bachelor female
## 6   1   72   24   1   6  60800       1         2      2    Ph.D.   Male
## 7   1   72   21   2   6  32800       2         2      0 Bachelor female
## 8   1   84   21   2   6  41000       2         2      0 Bachelor female
## 9   1   85   21   1   6  53500       1         2      2 Bachelor   Male
## 10  1   65   21   2   1  45400       1         2      2 Bachelor female
## Variables not shown: DType (chr)
## [1] 638002

Analyze with single factors

First I simply plot the histogram of different factors in the group of degree holders

## Source: local data frame [2 x 2]
## 
##      sex      n
## 1   Male 298444
## 2 female 339558

female has a slightly higher number of degree hodlers

## Source: local data frame [3 x 2]
## 
##   DegLevel      n
## 1 Bachelor 423943
## 2       MS 183182
## 3    Ph.D.  30877

There are a lot of Bachelors and very few mount of Ph.Ds.

## Source: local data frame [3 x 2]
## 
##        DType      n
## 1      Other 359523
## 2        S&E 217752
## 3 S&ERelated  60727

More people have a degree that not in science/engineering.

We can find that the total number of degree holders are evenly distributed between 25 - 65 and gradually decrease, as the total number of people are decreasing after 65 years old. A small increase at the end of 95 is because the US census data has a age cut off at 99

## $title
## [1] "Age vs type of Degrees"
## 
## attr(,"class")
## [1] "labels"

From this graph, we can see that there are more women getting into S/E fields.

Analyze with multiple factors

This shows no clear relation between age and their degree type. There are some fluctuations. Also it seems all three type of degree holders increasing and deceasing at the same time.

This graph shows clearly that MSs and Ph.D. have their first peak at the right of Bachelors, simply because they need to take more years to get a MS and PhD.

Is it worth to get a degree?

Now we have basic statistics about the US census data, we know there are many people have Bachelor’s degree. More people holding a non-Science/Engineering degree. Is it the more the better?

To answer this question, let’s break it into Three questions. * What salary I can expected with certain degree? * What’s the unemployment rate for different degrees? * Where should I find a job?

How much can you earn if you have certain degree BSc,MS,PhD, Science/Engineering?

Here I filter the income data which only take an income larger than 1000 into account.

Here we find there are a lot of outliers that earns 10 times more than others. I then try to use a log scale

Here I take Degree type into consideration.

In this density plot, we found the MS and Ph.D. have similar peak around $65,000 Bachelors’ income has a even distribution between 20,000 to 50,000 and decrease faster than MS and Ph.D.

In the last graph about income, we found all three type of degree follows the same pattern one different I noticed is that if you hold a degree in the S/E related field, you will have a much beeter chance to get a salary larger than $100,000.

Now we know, a higher degree in S/E or S/E related fields will reward you most financially Another question fellows:“Which one has a smaller unemployment rate and where should I get a job?”

What’s the unemployment rate for different degrees?

This graph clearly shows that higher degree levels have lower unemployment rates.

Unemployment rate in each state for different degree holders.

Now take a look at a the unemployment data in different state, depending on your degree level and degree type, you can choose which state to go to.

For a None-S/E degree holder, CA,NV ,NY have highest Unemployment rates.

For a S/E related degree holder, CO,OK,GA,NY have the highest unemployment rates. It’s generally lower than the None-S/E degree holders

stateTotalSE  <- ds%>%
                  filter(SciEng==2)%>%
                  group_by(ST)%>%
                  summarise(count = n())

jobLessSE  <- ds%>%
               filter(SciEng==2, ESR==3)%>%
               group_by(ST)%>%
               summarise(count = n())

jobLessSE <- right_join(jobLessSE , stateCodes, by.x=c("ST"))
jobLessSE[is.na(jobLessSE)] <- 0
jobLessSE <- mutate(jobLessSE, 
                    value = jobLessSE$count/stateTotalSE$count*100)

For a S/E degree holder, CA,OR,NY,MI have the highest Unemployment rates. It’s a little bit higher than the S/E related, but lower than the non-S/E.

Final Plots and Summary

Plot One

#### Reason This charts contains most of the informaion of the properteies that interested to me and possiblely to anyone who want to know what is the current status about the degree holders.It shows all that informaiton in a straitforward way.In this charts, people can choose any topic he/she is intereted and could find the result from it.

We can find many interesting things in this single graph 1. In every field, Ph.Ds are much less than Bachelors and Masters. 2. More people hold a none-S/E degree 2. Women has higher percentage in the none-S/E area. 3. Men holds more S/E degrees. 4. Women holds much more degree in S/E related field, it is reasonable since a large portion of science related degree is nurse.

Plot Two

#### Reason This charts answers the question about which type of the degree has the highest financial reward. It not only shows the average but also the 25,75 % so we can come up a conclusion more easily.

From this graph above, we can clear see that Science/Engineering degree have a higher income with the same degree level. Science and Engineering related degree have similar income range with the S/E degree. They both higher than non S/E degree. Another observation is that Ph.D.’s income is higher than MS and MS is higher than Bachelors in all three type of degrees. So a Ph.D. degree in S/E is rewarding financially most.

Now I am thinking about getting an Ph.D. in S/E field, where should I go to work?

Plot Three

Reason

This graph can give people a easy way to figuer out which state has the highest employment rate for the S/E Ph.Ds I choose this graph becasue I am pursuing a Ph.D. degree in U.S. right now and what to see if I get the Ph.D. Which state would give me a higher change to get employed. It is a narrow down prosess and answers my own question using the data.

From this graph, you can see if you have a Ph.D. degree in S/E, you will have a very high chance to get a job in WY,SD,NE,KS and OK. CA,OR,MI,IL,NY might not be a good choice.

Reflection

The project begins with a simple question about wether it is worth to get a Ph.D, and then compares different types of degrees to see what kind of Ph.D worth most.

The US censor data give us the I choose to run on the Script by Kaggle, at first it goes well, however as the code builds it takes several minutes to get a single run. Which takes me a long time to debug. The data is very clean, it saves a lot of time for cleaning for instance, when I choose degree level = 21,22,24, and then draw the histogram of age, the youngest is 19 which is reasonable.
It does have many NAs about 1/4 of the total population doesn’t have a Bachelor or higher degree.

Future work,

  • One might take a look at how those people without a college degree performs in the society.
  • Try to answser why ND is different, it has relatively very high unemployment rate for S/E Ph.D. But a very low unemployment rate for S/E Bachelors and Masters.
  • Study for specific type of degrees like MS in Computer Science, Ph.D in physics etc.